Database Reformulation with Integrity Constraints 

(extended abstract) 
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Abstract 

In this paper we study the problem of reducing the 
evaluation costs of queries on finite databases in pres- 
ence of integrity constraints, by designing and ma- 
terializing views. Given a database schema, a set 
of queries defined on the schema, a set of integrity 
constraints, and a storage limit, to find a solution 
to this problem means to find a set of views that 
satisfies the storage limit, provides equivalent rewrit- 
ings of the queries under the constraints (this require- 
ment is weaker than equivalence in the absence of 
constraints), and reduces the total costs of evaluat- 
ing the queries. This problem, database reformula- 
tion, is important for many applications, including 
data warehousing and query optimization. We give 
complexity results and algorithms for database refor- 
mulation in presence of constraints, for conjunctive 
queries, views, and rewritings and for several types 
of constraints, including functional and inclusion de- 
pendencies. To obtain better complexity results, we 
introduce an unchase technique, which reduces the 
problem of query equivalence under constraints to 
equivalence in the absence of constraints without in- 
creasing query size. 

1 Introduction 

In many contexts it is beneficial to answer 
database queries using derived data called views. 
A view is a named query, which can be stored in a 
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database system definition (virtual view) or 
as an answer to the query (materialized view). A 
user query can be answered using views via a new 
definition that is called a rewriting and is built 
in terms of the views. Using virtual or materi- 
alized views in query answering |LMSS95 j is rel- 
evant in applications in information integration, 
data warehousing, web-site design, and query op- 
timization. Two main directions in answering 
queries using views are (1) feasibility: to obtain 
some answer to a given query using given views, 
as in the information- integration scenario, and 
(2) efficiency: to reduce query-execution time by 
using the views, as in the query-optimization sce- 
nario. Within the efficiency direction, which is 
our focus in this paper, the objective is typically 
to use views to obtain equivalent query rewrit- 
ings — that is, definitions that give the exact 
answer to the query on all databases. Answering 
queries using views has been explored in depth 
for relational database systems |Kan90| and for 
conjunctive queries, which can be defined via 
positive existential conjunctive formulas of first- 
order logic |End72j ; for a survey of methods for 
answering queries using views see |Hal01j . 

In the past few years, significant research ef- 
forts have been concentrated on view selection, 
that is, on developing methods for defining and 
precomputing materialized views to answer pre- 
defined queries; existing approaches differ in 
their main objective (feasibility or efficiency) and 
in how they explore the search space of views 
and rewritings for the given queries, typically 
on finite databases. [CGOO] introduced the ap- 
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proach of database reformulation, with an em- 
phasis on efficiency and on complete exploration 
of the search space of efficient rewritings. For- 
mally, starting with a set of finite database re- 
lations and a set of queries, the problem is to 
design a set of views of the database relations 
that (1) can be materialized under a given re- 
striction (such as a storage limit, i.e., the amount 
of disk space available for storing the view re- 
lations) and, once materialized, (2) can be used 
by a given evaluation algorithm in answering the 
queries equivalently and more efficiently than the 
original relations. The schema consisting of the 
materialized views is called a reformulation of 
the problem input. A reformulation is beneficial 
(or optimal) if it is as efficient as or more efficient 
than the original (or every other) [re] formulation 
on all given queries and all databases consistent 
with the given schema. It has been shown CGOO 
that there are reformulation problems for which 
there are infinitely many beneficial reformula- 
tions; at the same time, only finitely many of 
these reformulations need to be considered since 
any other reformulation is either larger or less 
efficient to use. Therefore, it is possible to find 
an optimal reformulation in finite time. 

The results in |CG00j do not take into ac- 
count integrity constraints, or dependencies, on 
the base relations in the database. Dependencies 
are semantically meaningful and syntactically re- 
stricted sentences of the predicate calculus that 
must be satisfied by any "legal" database; exam- 
ples include functional dependencies and foreign- 
key constraints Kan90, AHV95 . The presence 
of dependencies can increase the set of benefi- 
cial reformulations of a database. Consider an 
example: 

EXAMPLE 1.1 Let a query Q be defined on 
a database with schema {S(A,B), T(C,D)} as 
q(X,Y) : - s(X,Y), s(X,a), t(Y,a). 
Consider a view V, 
v(X,W) : - s(X,a), t(a,W). 
Query Q — but not view V — has self-joins, 
that is, the definition of Q but not of V has mul- 
tiple literals with the same relation name. It can 
be shown LMSS95 that in the absence of depen- 
dencies, V cannot be used to equivalently rewrite 



Q. At the same time, suppose the database sat- 
isfies a functional dependency a, 

a: VX,Y,Z (s(X,Y) A s(X, Z) -» (Y = Z)). 

This dependency means that whenever two tu- 
ples in relation S agree on the value of the first 
attribute A, they also agree on the value of the 
second attribute B of S. 

On all databases satisfying the dependency a, 
the query Q can be equivalently rewritten 1 using 
the view V, as follows: 

q(X, a) : — v(X, a). 

The reformulation is optimal on all databases 
satisfying a, as the materialized view V precom- 
putes an exact answer to Q. □ 

In this paper we enhance the results of [CGOO 
to deal with the additional complexities that 
arise in presence of dependencies. The problem 
we consider is as follows: given a set of queries, a 
set of dependencies, and a storage limit, is it pos- 
sible to efficiently generate reformulations that 
satisfy the storage limit and minimize the total 
costs of evaluating the queries, in the presence 
of the dependencies. We look at this problem 
for conjunctive queries, views, and rewritings on 
finite databases in presence of several types of 
dependencies, including functional and inclusion 
dependencies. Our results are applicable in data 
warehousing and query optimization. Our con- 
tributions are as follows: 

• we give a new algorithm and tighter com- 
plexity results for database reformulation 
in the absence of dependencies, for queries 
without self-joins (Section |2.4|) : 

• we give complexity results and algorithms 
for database reformulation in presence of 
dependencies, based on the chase tech- 
nique [AHV95j for incorporating dependen- 
cies into query definitions (Section EJ) ; 

• we introduce an unchase technique for re- 
ducing the problem of query equivalence un- 
der dependencies to query equivalence in the 
absence of dependencies, without increasing 
query size (Section 



We assume set semantics |UV93| for query evaluation. 
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• we show that we can reduce the complexity 
of database reformulation and cover larger 
classes of dependencies by basing the re- 
formulation algorithm on the unchase ap- 
proach (Section 0J). 

After covering related work in the remainder of 
this section, we give basic definitions and formal 
problem statement in Section[21 We then present 
complexity results and algorithms for database 
reformulation: Section |3] describes an approach 
based on chase, and Section |1] discusses our un- 
chase technique. We conclude and discuss future 
work in Section [5J 

Related work 

Studies of dependencies have been motivated by 
the goal of good database schema design; in- 
terestingly, they have also contributed to basic 
research in mathematical logic. The study of 
dependency theory began with the introduction 
of functional dependencies in |Cod72j : inclusion 
dependencies were first identified in |CFP84| . 
The topic of queries defined over databases that 
satisfy dependencies was initiated in ASU79b, 
ASU79a . Containment in the presence of inclu- 
sion dependencies has been examined in KCV83 , 
JK84|. For surveys and references on data de- 
pendencies, see |FV84I IK an 901 IAHV95j . 

An important technique named chase grew out 
of the algorithm of [ABU79] for testing loss- 
less joins. The chase can be further extended 
into a semidecision procedure for embedded- 
dependency implication and an exponential de- 
cision procedure for full dependency implication, 
see |BV84bl IBV84aj . In its most general form, 
chase is similar to resolution with paramodula- 
tion. See |Deu021 |DLN05| and references therein 
for applications of chase to answering queries 
equivalently using views. 

Conjunctive queries [HM771 IASU79bl 
ASU79a form a large and well-studied class of 
queries that contains a large proportion of those 
questions one might wish to ask in practice. 
When there are no dependencies to consider, 
or when there are only functional dependen- 
cies, it has been shown that the containment, 
equivalence, and minimization problems are all 



NP-complete |CM77j . These results should not 
be viewed as negative, especially for problems 
concerned with query optimization, since queries 
are typically much smaller than the databases 
on which they are asked, and queries may be 
applied repeatedly over time |JK84j . 

References to view selection can be found 
in |HalQ11 ICHS021 I ACQS] . To the best of our 
knowledge, the results presented here are the 
first results on view selection in presence of de- 
pendencies. 

2 Preliminaries 

In this section we provide definitions and techni- 
cal background for our framework, using in part 
the materials in |Kan90| IATW95] . 

2.1 Basic definitions 

A relational database is a finite collection of 
stored relations. Each relation R is a finite 
set of tuples, where each tuple is a list of val- 
ues of the attributes in the relation schema of 
R. We consider select-project-join SQL queries 
with equality comparisons, a.k.a. safe conjunc- 
tive queries. A conjunctive query is a rule of the 
form: Q : q(X) «- ex(X\), ... , e n (X n ), where 
e\, . . . e n are names of database relations and 
X, Xi, . . . , X n are vectors of variables. A query 
Q has self-joins if at least two different atoms 
ej(Aj), ej(Xj) in the body of Q have the same 
relation name. The variables in X are called 
head or distinguished variables of Q, whereas the 
variables in Xi are called body variables of Q. A 
query is safe if X C UiLi ^i- 

2.2 Dependencies and chase 

A dependency over a database schema S is a sen- 
tence in some logical formalism over S. We con- 
sider tuple- generating dependencies (tgds) and 
equality-generating dependencies (egds) BV84b . 
A tgd is of the form V x (0(x) — > 3 y ip(x,y)), 
and an egd is of the form V x (<fi(x) — > (x» = 
Xj)). Here, x = xi, . . . , x k , y = yi, ■ ■ ■ , y m , and 
each of element in x. In addition, 

we consider consistency constraints of the form 
V x (4>(x) — ► false). In this paper, we consider 
conjunctive egd's, tgd's, and consistency con- 
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straints, that is, in all the dependencies we con- 
sider, 4>{x) is a conjunction of relational atoms, 
and ij)(x,y) (in the tgd's) is a single relational 
atom. We refer to conjunctive egd's as functional 
dependencies (fds), and distinguish between two 
types of conjunctive tgd's: In value-preserving 
tgd's, y in the right-hand side in empty, and in 
value-generating tgd's, y contains at least one 
variable name. In many results in this paper, 
we focus on a special case of conjunctive tgd's 
called inclusion dependencies (ids), which have 
just one relational atom in the left-hand side. 
Inclusion dependencies may be value preserving 
or value generating. We will use a shorthand no- 
tation, in which quantifiers are not used where 
clear from context. For ids we will also use 
the notation r[x] C s[x], which is equivalent to 
V x, z (r(x,z) —>3y s{x,y)). 

A set E of ids is acyclic if there is no sequence 
fi[^i] Q Si[xi] {i e [1, . . . n]) of ids in £ where for 
i e [1, . . . n], rj + i = Sj for i e [1, . . . , n — 1], and 
ri = s n . A family £ of dependencies has acyclic 
ids if the set of ids in £ is acyclic AHV95 . We 
define acyclic tgds as follows: A set £ of tgds is 
acyclic if there is no sequence u\, 02 > . . . , a n of 
tgds of £, Oi : rn{xii) A ... A r ik (x ik ) ->■ 
s i(y~i) (i e [l>--- n D °f tgds in E where for 
i e [1, . . . n], the left-hand side of cr.; + i includes 
the relation name for the right-hand side of cxj, 
for i e [1, . . . n—1], and the left-hand side of o~\ in- 
cludes the relation name for the right-hand side 
of a n . A set E of n acyclic tgds is strongly acyclic 
if there exists a sequence a\ , o% , . . . , o~ n of tgds 
of E, such that for i e [1, . . . , n — 1] and for all 
k > such that i + k < n, the right-hand side of 
crj + fc does not include any relation name in the 
left-hand side of <Tj. A family E of dependencies 
has (strongly) acyclic tgds if the set of tgds in E 
is (strongly) acyclic. 

We denote the left-hand side of a dependency 
(or the body of a query) by A. An assignment 
7 for A is a mapping of the variables appearing 
in A to constants, and of the constants in A to 
themselves. Assignments are naturally extended 
to tuples and atoms; for instance, for a tuple of 
variables s = (s±, . . . , s^) we let 7s denote the 
tuple (7(^1), . . . ,7(sfc)). Satisfaction of atoms 



by an assignment w.r.t a database is defined as 
follows: pi (7s) is satisfied if the tuple 7s is in the 
relation that corresponds to the predicate of pi. 
This definition is naturally extended to that of 
satisfaction of conjunctions of atoms. An answer 
to a safe query Q with head q(x) and body A on 
a database T> is the set of all tuples 7(2;) such 
that 7 is a satisfying assignment for A011P. 

A database T> satisfies a set of dependencies E 
if, for each dependency a in E and for all satisfy- 
ing assignments 7 of the left-hand side of a w.r.t. 
T>, a evaluates to true. (For value-generating 
tgd's a, we additionally require that we can ex- 
tend each 7 in such a way that the right-hand 
side of a evaluates to true.) For a given set E 
of dependencies and conjunctive queries Q\ and 
Q2, Q\ is contained in Q 2 under E, denoted by 
Qi Q2, if for any database V that satisfies E, 
the answer to Q\ on 2? is a subset of the answer 
to Q2 on V. Two queries are equivalent under 
E if they are contained in each other under E. 
Query containment and equivalence in the ab- 
sence of dependencies is defined as above for the 
case E = cj) (empty set). 

In this paper we use the following results 
of |CM77j for conjunctive queries. In the ab- 
sence of dependencies, a query Q\ is contained 
in Q2 if and only if there exists a containment 
mapping from Q2 to Q\, that is, a homomor- 
phism from the variables of Q2 to the variables 
and constants of Q\, such that (1) each atom in 
the body of Q2 is mapped into some atom in the 
body of Qi, and (2) the head of Q2 is mapped 
into the head of Q±. For a query Q, its mini- 
mized version is an equivalent query Q' with a 
minimum number of subgoals, which can be ob- 
tained via repeated applications of containment 
mappings. Two queries are equivalent if and only 
if their minimized versions are isomorphic. 

It is easy to show the following: 

Proposition 2.1 Given a database schema S, 
queries Qi and Q2 defined on S, and a set E of 
dependencies on S, if Q\ is contained in Q2 in 
the absence of dependencies, Q\ C Q2, then Q\ 
is contained in Q2 under E, Q\ C s Q 2 . □ 

The chase is a process that, given dependen- 
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cies E, transforms a query Q into a query Q' such 
that Q =s Q'. A chase sequence of a conjunctive 
query Q : q : — A by a set of dependencies E 
is a (possibly infinite) sequence of conjunctive 
queries (q ,A ),(qi,Ai),...,(q i ,A i ),..., where 
qo = q and = A and for each i > 0, the query 
is the result of applying some depen- 
dency in E to the query (qi,Ai). We can apply 
a dependency to a query if there is a satisfying 
assignment 7 of the left-hand side of the depen- 
dency w.r.t. the body of the query. For fds, the 
chase rule is to consistently rename query vari- 
ables according to the equality in the right-hand 
side of the fd. For inds, the chase rule JK84 
adds to a partial chase result (qi,Ai) a subgoal 
that matches the right-hand side p(x) of the ind, 
provided no existing subgoal in (qi,Ai) matches 
p(x). This rule is extended to tgds in a natural 
way. The chase sequence is terminal if (1) it is fi- 
nite, and (2) no dependency in E can be applied 
to the last element in the sequence. The result 
of a terminal chase sequence is its last element 
(q n , An), written in query form as Q' : q n : — A n . 

Definition 2.1 (Chase) For a query Q and a 
set of dependencies E, the chase of Q by E, de- 
noted chase-£(Q), is the result of any terminal 
chasing sequence of Q by E. □ 
Given a query Q and dependencies E, we com- 
pute chasez(Q) by picking the dependencies in 
E in some arbitrary order and applying them 
to Q. Importantly, the chase is determined by 
the semantics, rather than the syntax, of the 
dependencies in E. Let E and E' be two sets 
of dependencies over schema S. If E = E' 2 , 
then chase^(Q) and chaser (Q) coincide for any 
query Q. 

The following result has been shown for sets of 
functional dependencies in |AHV95j : we have ex- 
tended it to sets of any dependencies considered 
in this paper. 

Theorem 2.1 Given conjunctive queries Q\, 
Q2 and a set E of fds, conjunctive consistency 
constraints, and conjunctive tgds. 

1- Qi Q2 iff chasez(Qi) C chase-£(Q 2 ) in 
the absence of any constraints. 

2 E = E' if E |= E' and E' |= E. 



2. Qi =s Q2 iff chases(Qi) = chasez(Q 2 ) in 
the absence of any constraints. □ 

2.3 Views and database reformulation 

A view refers to a named query. A view is said 
to be materialized if its answer is stored in the 
database. Let V be a set of views defined on a 
database schema S, and P be a database with 
schema S; by £>y we denote the database ob- 
tained by computing all the view relations in V 
on V. Let Q be a query defined on S, and V 
be a set of views defined on S. A query R is a 
rewriting of Q using V if all atoms in the body 
of R are vie predicates defined in V. 

The expansion R ex P of a rewriting R of a query 
Q on a set of views V is obtained from R by 
replacing all the view atoms in the body of R by 
their definitions in terms of the base relations. A 
rewriting R of a query Q on a set of views V is 
an equivalent rewriting of Q under E if for every 
database V that satisfies E, Q(D) = TZ(Dy). 

We consider the following database- 
reformulation problem: Given a set of con- 
junctive queries Q on stored relations, a fixed 
database instance T> that satisfies a set of 
dependencies S, and a storage limit L, we want 
to find and precompute offline a set of views 
on the stored relations. A set of views V is 
admissible for (Q, T>, E,L) if (1) V provides an 
equivalent rewriting for each query in Q under 
E, and (2) the total size of the relations for V 
on T> does not exceed the storage limit L. (The 
size of a relation is the number of bytes used 
to store the relation.) Among such admissible 
sets of views, our goal is to find a beneficial 
(or optimal) viewset, that is, a set of views 
whose use in rewritings of the queries in Q 
reduces (minimizes) the sum of evaluation costs 
of these queries on the database V satisfying 
the dependencies E. For query-evaluation costs, 
we consider size-monotonic cost models, where 
(1) query costs are computed using the sizes 
of the contributing relations, and (2) whenever 
a relation in a query expression is replaced by 
another relation of at most the same size, the 
cost of evaluating the new expression is at most 
the cost of evaluating the original expression. 
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All the common cost models in the literature 
are size-monotonic. 

Definition 2.2 (Database reformulation) 

For a problem input X = (Q, T>,T,,L), a ben- 
eficial (optimal) viewset is a set of views V 
defined on S, such that: (1) V is an admissible 
viewset for I, and (2) V reduces (minimizes) the 
total cost of evaluating the queries in Q on the 
database £>y. □ 

We consider this problem in relational 
databases for conjunctive queries, views, and 
rewritings. We assume that filtering views are 
not used in query rewritings. 3 In some results 
we additionally assume that input queries do 
not have self-joins. We use these simplifying as- 
sumptions to do an initial study of the structure 
of the database-reformulation problem under de- 
pendencies. It is known that when these assump- 
tions do not hold, the problem has a triply ex- 
ponential upper bound and a singly exponential 
lower bound even in the absence of dependen- 
cies CHS02 . The database-reformulation prob- 
lem is in NP in the absence of dependencies when 
input queries do not have self-joins and when fil- 
tering views are not used |ACGP05] . 

2.4 The cgalg algorithm [CCfOO] 

We now outline an algorithm for generating ben- 
eficial reformulations for the case where the set 
of dependencies S is empty and Q comprises a 
single query Q |CG00j . For each beneficial refor- 
mulation (viewset) V for a problem input I, this 
algorithm generates at least one beneficial refor- 
mulation (viewset) V' that reduces the costs of 
the input query workload at least as much as 
V and satisfies the same storage limit. We say 
that the algorithm produces the best beneficial 
database reformulations. 

Procedure cgalg. 

Input: query Q, database T>, storage limit L. 
Output: R op t, optimal equiv. rewriting of Q on T>. 

1 Begin: 

2 minimize Q to obtain a query Q'; 



3 In an equivalent rewriting R of a query Q, a view V 
is a filtering view if the result of removing the literal for 
V from R is still an equivalent rewriting of Q. 



3 set R opt to Q'; 

4 set the cost C opt of R opt to C(Q'); 

5 find all views V whose body is a subset 
of subgoals of Q'; 

6 for each subset W of V such that 
Tjw e wsize(W, T>) < L do: 

7 begin: 

8 find a rewriting R of Q' using W; 

9 construct the expansion R ex P of R; 

10 if there exists a containment mapping 
from Q' to R exp then: 

11 if the cost C(R, T>, O) of answering Q' 
on T> using R is less than C op t 

12 then begin: 

13 Ropt : = R\ 

14 C p t := C(R,V,0); 

15 end; 

16 end; 

17 return R op t- 



18 End. 

A view-size oracle O instantaneously gives the 
size of any relation defined on the database T>\ 
we assume that for a rewriting R in terms of 
views and for a fixed size-monotonic cost model 
for query evaluation, the time required to obtain 
the cost C(R,T>,0) of evaluating R in terms of 
the relations for the views on T> is negligible when 
using the oracle O. In practice, the view sizes 
and costs of answering Q on D using R can be 
estimated via standard formulas used in query 
optimizers in database-management systems. It 
is easy to see how the cgalg algorithm can be 
extended to problem inputs with non-singleton 
query workloads. 4 

Proposition 2.2 \(J(Wa{Ic77?U^ Given S = 
(f> and provided that all view atoms in all rewrit- 
ings have different relation names and that fil- 
tering views are not used in query rewritings, the 
algorithm cgalg is sound for problem inputs with 
workloads of arbitrary conjunctive queries and is 
complete for problem inputs with workloads of 
conjunctive queries without self-joins. The de- 



4 When queries have no self-joins and each view in V is 
used exactly once in the rewriting of exactly one query in 
Q |ACGP05) . cgalg can look for views for each workload 
query separately even when E is not empty. 
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cision version of the problem of finding optimal 
reformulations is NP complete. □ 

In general, the algorithm is not complete (i.e., is 
not guaranteed to produce an optimal reformu- 
lation) because some optimal rewritings may use 
self-joins of view literals |CH S02 . 

Proposition 2.3 Under the assumptions of 
Proposition \2.2\ and assuming that a view-size 
oracle O and a size-monotonic cost model for 
query evaluation are given, the runtime of cgalg 
is ®{2 m ), where m is the total number of subgoals 
of the queries in the input workload Q. □ 

Intuitively, under the assumptions of Proposi- 
tion c g a lg w hl generate all beneficial refor- 
mulations if it generates only viewsets that have 
up to m views [5CGP05] . Note that the step of 
generating a rewriting given a subset W of the 
set V of views takes constant time in the size of 
the subset W |ALU01j . 

3 Dependencies and Chase 

In this section and in Section 01 we consider the 
database-reformulation problem for workloads of 
conjunctive queries under a nonempty set of de- 
pendencies X. In this section our focus is on 
using chase to extend the cgalg algorithm (Sec- 
tion 12. 4|) to database reformulation in presence 
of dependencies. 

We first observe that the straightforward ap- 
proach to finding all useful views and rewritings 
does not really work. Given a query Q and a set 
of dependencies X, we can use Theorem l2.1l to re- 
duce the problem of finding rewriting expansions 
that are equivalent to Q under X to the problem 
of finding rewriting expansions whose terminal 
chase result (under X) is equivalent, in the ab- 
sence o/X, to the terminal chase result Q c of Q 
under X. Even if Q c is unique and finite, the 
number of queries that are equivalent to Q c is 
infinite CM77 , and the number of all beneficial 
views and rewritings can be infinite [CGOO] . In 
this section we use chase to extend the approach 
of |C(t OO of generating the best (rather than all) 
beneficial viewsets using the cgalg algorithm. 



3.1 Consistency constraints 

We first obtain that consistency constraints do 
not generate new views. 

Theorem 3.1 Let 2 be a problem input where 
all dependencies in X are consistency con- 
straints. Then an optimal set of views V for 2 
can be found by finding an optimal set of views 
for the problem input that is obtained by remov- 
ing all dependencies from 2. □ 
A corollary of this result is that if at least 
one consistency constraint is combined with any 
number of fds and tgds, then the database- 
reformulation output is the same as for a prob- 
lem input where all the consistency constraints 
are removed. 

3.2 Functional dependencies 

As we saw in Example 11.11 in Section ^ un- 
like consistency constraints, fds can generate new 
beneficial reformulations. 

Lemma 3.1 \AH V95f Let X be a set of fds; 
for any query Q, let Q' = chase^,{Q). Then (1) 
Q' is unique up to variable renamings, and (2) 
the size of the minimized version of Q' does not 
exceed the size of the minimized version ofQ. □ 
Theorem 3.2 Algorithm cgalg({c/ioses(Q)}, 
T>, L) produces an optimal reformulation of a 
problem input 2 where Q = {Q} and where 
all dependencies in X are fds, provided that all 
queries in the workload {chases(Q)} have no 
self -joins. □ 
Note that to produce an optimal reformulation, 
we only need to consider the terminal chase re- 
sult of each query in the workload Q. The com- 
plexity of cgalg here does not exceed the com- 
plexity of cgalg for the same problem input in 
the absence of dependencies; note that the origi- 
nal queries may have self-joins (see Example ll.lj) . 

3.3 Conjunctive tgds 

We now consider problem inputs whose depen- 
dency sets X contain acyclic sets of conjunctive 
tgds. We first consider the case where all tgds 
are ids. 

Proposition 3.1 \AH V9h^ Let Q be a query 
and X a set of fds and acyclic ids. Then each 
chasing sequence of Q by X terminates after an 
exponentially bounded number of steps. □ 
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Proposition 3.2 Let S be a of fds T*[F] and 
acyclic ids S[F] U = S. T/ien 

/or all conjunctive queries Q, chase^(Q) = 
chaseY,[i]{chase-^[F]{Q)) ■ D 

This result extends the result of JK84 for a spe- 
cial class of sets of fds and "key-based" ids; to 
obtain the extension, we use the observation that 
the chase rule for ids in |JK84j (which we also 
use) does not add to the partial chase result Q CyP 
the right-hand side of a qualifying id if a match- 
ing subgoal is already in Q CjP . In extending the 
result to acyclic tgds, the subtlety is that (part 
of) the left-hand side of a tgd can match the 
left-hand side of an fd in the same set of de- 
pendencies, which would cause Proposition 13.21 
to be violated. (For instance, £ can include a 
tgd s(X,Y) A s(X,Z) -> p(X,Z) and an fd 
s{X,Y) A s(X,Z) ->Y = Z.) We obtain the 
result of Proposition 13.21 for sets of dependencies 
£ that have been preprocessed, by applying each 
fd in £ to the left-hand side of each tgd in £. 

Theorem 3.3 Algorithm cgalg({c/iases(Q)}, 
T>, L) is sound for problem inputs 2 where Q = 
{Q} and where J] is a set of fds and acyclic 
tgds. The algorithm is complete for such inputs 
if queries chase-£,{Q) have no self-joins. □ 

4 Reducing the Complexity by 
Unchase 

In Section |31 we saw that we can obtain the best 
beneficial reformulations for a workload of con- 
junctive queries in presence of consistency con- 
straints, fds, and acyclic tgds, either separately 
or in combination, by using the cgalg algorithm 
on the terminal chase results of the workload 
queries. At the same time, the restrictions on 
this approach are rather strong. First, the ter- 
minal chase result of each query cannot have self- 
joins if we want to obtain optimal reformulations. 
Second, as shown in Section |2~H the complexity 
of cgalg is exponential in the size of the queries 
to which cgalg is applied, that is, to the termi- 
nal chase results of the workload queries. 

We now give an example where the terminal 
chase result of a query under acyclic ids (1) has 
self-joins, and (2) is of size exponential in the 



size of the query. Thus, the cgalg approach of 
Section |3] is not guaranteed to produce optimal 
reformulations in this case, and the cost of us- 
ing the approach to produce some beneficial re- 
formulations would be prohibitive even for sim- 
ple queries. However, in this section we give a 
modified cgalg approach that is applicable to 
the problem input of this example and to other 
cases, including problem inputs where the termi- 
nal chase results of the input queries under the 
input dependencies are infinite in size. 

EXAMPLE 4.1 On a database schema S = 
{PxiA^Bx), P 2 (A 2 ,B 2 ), P m (A m ,B m )}, 
consider a query Q with a single subgoal p\ : 
q(X,Y) :- Pl (X,Y). 

Suppose the database schema S satisfies a set £ 
of acyclic ids of the following form: 
of] : Pl (X,Y)^ Pj (Z,X) 
af]: p t (X,Y)^ Pj (Y,W) 

£ has one id crj ^ and one id af\j for each pair 
(i,j), where i e {l,...,m — 1} and j e {i + 
1, . . . , m} (i < j in each pair). Thus, the number 
of dependencies in £ is quadratic in m. 

We show one partial chase result of the query 
Q under dependencies £, for m > 2: 
q'(X,Y) :- Pl (X,Y), p 2 (Z 1 ,X), p 2 (Y,Z 2 ). 
This query Q' is the result of applying to Q de- 
pendencies an d o-^\. 

For the terminal result Q c of chasing the query 
Q under the ids S, we can show that the size of 
Q c is exponential in the size of Q and S. □ 

For the problem input in this example, the 
cost of using cgalg of Section El is doubly ex- 
ponential in the size of the query Q, and the 
problem of finding beneficial reformulations has 
an exponential-size lower bound, just because we 
need to output views that cover all the subgoals 
of this exponential-size terminal chase result. 
Thus, the terminal chase result of a query un- 
der acyclic ids can have an exponential number 
of views even for (1) nonfiltering views only, and 
(2) no self-joins in input queries (cf. |ACGP05] ). 

4.1 Unchase for ids and tgds 

The idea we outline in this section is to apply our 
reformulation algorithm to those versions of the 
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input queries that have all "derived" subgoals 
removed. Thus, our approach is to (1) apply 
"unchase" to all the input queries under the in- 
put dependencies, and then to (2) apply cgalg 
to the results of the unchase. 

We first define unchase for sets of ids only: 
Given a finite-size query Q and an id a, an un- 
chase step on (Q, a) is to remove from Q a sub- 
goal s that is the image, under some homomor- 
phism fj,, of the right-hand side r of a, provided 
that two conditions are satisfied. First, the ho- 
momorphism \i can be extended to map the left- 
hand side of a into some subgoal of Q other than 
s. Second, for each free argument Y of r, /u(Y) in 
s (1) is a variable rather than a constant, (2) is a 
nondistinguished variable of Q, and (3) does not 
occur in any subgoal of Q except s. For instance, 
if we apply the id '■ Pi(X, Y) — > P2(Z, X) to 
query Q' in Example 14.11 we will obtain a query 
q"(X,Y) :- Pl (X,Y), p 2 (Y,Z 2 ). 

For a query Q and for a set of ids E, we de- 
note by Qu,s the terminal unchase result of Q 
under E. Note that unchase under ids termi- 
nates in finite time, as each successful unchase 
step removes a subgoal from the current partial 
unchase result. We obtain the following unique- 
ness result for unchase under ids: 
Lemma 4.1 For a conjunctive query Q, for a 
set of dependencies E that has ids only, and for 
any finite- size (either partial or terminal) chase 
result Q' of Q under E, Q u ^ is equivalent to 
Q'u s * n ^ e absence of dependencies. □ 
It follows |CM77| that the result of minimizing 
QuS is isomorphic to the result of minimizing 
Q'u E • Note that in Lemma 14.11 we do not re- 
quire id acyclicity, and thus the result applies 
to problem inputs with sets of cyclic ids, such as 
{p(X, Y) — > p(Y, Z)}. We have also extended the 
result of Lemma 14.11 to sets of strongly acyclic 
tgds; the unchase rule for tgds is analogous to 
that for ids. (We require strong acyclicity in the 
proof to ensure that all tgds can be applied in 
the unchase process.) 

4.2 Unchase in presence of fds 

Using Lemma l4.1| we can show that cgalg can 
be applied to the problem input of Example 14.11 



to obtain an optimal reformulation from just the 
terminal unchase result (which is Q itself) of the 
query Q under the set E of ids. However, we can 
extend the unchase/cgalg approach to combina- 
tions of ids (or of strongly acyclic tgds) with fds. 
We first note that if we try to unchase a query 
using fds only, the unchase process will not ter- 
minate in finite time: 

EXAMPLE 4.2 For a query 
q(X,Y) :-p(X,Y). 

and for a set of dependencies E with a single fd, 
E = {a : p{X,Y) A p(X,Z) -> Y = Z}, an 
unchase step "add to Q a subgoal p with a fresh 
variable for the second argument" can be applied 
infinitely many times. This query Q' is a partial 
unchase result after two steps: 
q'(X,Y) :-p(X,Y), p(X,Z 1 ), p{X,Z 2 ). Q 

At the same time, we can guarantee unchase 
termination and "good" properties of the cgalg 
approach if we incorporate fds into unchase as 
follows: (1) An unchase step for fds is the same as 
a "regular" chase step on fds, see Section l2~2l (2) 
A query is unchased in presence of fds combined 
with ids (tgds) by applying all the ids (tgds) be- 
fore all the fds. The complexity of unchase under 
ids only is m 3 1 E | , where m is the total number of 
subgoals in the query workload; the complexity 
of unchase under ids and fds is m 4 |E|. 

Proposition 4.1 For a conjunctive query Q, 
for a set of dependencies E that has fds ei- 
ther alone or in combination with ids or strongly 
acyclic tgds, and for any finite-size ( either par- 
tial or terminal) chase result Q' of Q under E, 
Qu,s is equivalent to Q' u ^ in the absence of de- 
pendencies. □ 

To prove this result, we apply and extend the 
id/fd separability result of |JK84j that says that 
chasez[ I+F }(Q) = chase^^chase^^p^Q)) (for 
the notation, see Proposition I3.2|) . 

This result is obtained using Proposition 14. 11 
Theorem 4.1 For any two conjunctive queries 
Qi and Q 2 and for a set of dependencies E that 
satisfies the conditions of Proposition ^^ Q\ =£ 
Q 2 if and only if Qi, u ,t, is equivalent to Q 2 , U ,T, in 
the absence of dependencies. □ 
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To obtain beneficial reformulations for a prob- 
lem input I, we apply cgalg on the terminal 
results of unchasing the workload queries in X 
under the set of dependencies in X. 

Theorem 4.2 cgalg({Q ni s}, T>, L) is sound for 
problem inputs X where the workload Q = {Q} 
has conjunctive queries only and such that £ sat- 
isfies the conditions of Proposition \4-l\ The al- 
gorithm is complete for such problem inputs pro- 
vided the queries Q Ut s have no self-joins. □ 

By definition of the unchase process, the com- 
plexity of cgalg in this case is 0(2 m ), where m 
is the total number of subgoals in the workload 
queries in the problem input X. 

Theorem 4.3 For problem inputs X that sat- 
isfy the conditions of Theorem \4-'A the decision 
version of the problem of generating optimal re- 
formulations is in NP, provided that the queries 
unchaseY,{Q) have no self-joins. □ 

It is remarkable that, given a problem input 
X and the rewritings produced by cgalg on the 
terminal results of unchasing the queries in X 
using the dependencies in X, to show the equiv- 
alence of the original workload queries to the 
rewritings, we do not need to unchase the ex- 
pansions of the rewritings. (Note that one needs 
to apply chase to discover rewritings that are 
equivalent to queries under dependencies; see, 
e.g., |DLN05| . We can show that if we used 
the approach described in Section EJ we would 
need to chase the rewriting expansions to show 
the equivalence of the rewritings to the original 
queries.) 

Theorem 4.4 For problem inputs X that sat- 
isfy the conditions of Theorem \4- 6 A let R be a 
reformulation of some query Q in X, such that 
R is returned by cgalg({Q U) x;}, X>,L). Suppose 
R ex P = Q u y, in the absence of dependencies. 
Then R™P = Q u .s in the absence of dependen- 
cies. □ 

5 Conclusions; Future Work 

We have presented complexity results and cgalg 
algorithms for database reformulation in pres- 
ence of dependencies. Our results apply to con- 
junctive queries and to the types of dependencies 



that include commonly used functional depen- 
dencies, inclusion dependencies, and foreign-key 
constraints. We argued that to generate benefi- 
cial reformulations, one can use the chase tech- 
nique for incorporating dependencies into query 
definitions. At the same time, we showed that we 
can reduce the complexity of database reformula- 
tion and cover larger classes of dependencies by 
incorporating into the reformulation algorithm 
our unchase approach; the idea of unchase is to 
remove from a query all "derived" subgoals that 
would be introduced by chase. 

The unchase/ cgalg approach can be extended 
to workloads of queries with self-joins, at the 
expense of an increase in runtime complexity 
(cf. |CHS02llAT?UPn5] N l. We are currently work- 
ing on extending the approach to database refor- 
mulation for queries with aggregation. Another 
direction of our ongoing and future work is de- 
signing efficient algorithms for database reformu- 
lation for common classes of queries and depen- 
dencies. Besides database reformulation, the un- 
chase approach can be used in answering queries 
using views, as it reduces the problem of check- 
ing query containment (equivalence) in presence 
of dependencies to the problem of containment 
(equivalence) checking in the absence of depen- 
dencies, without increasing query size. Note that 
unchase, unlike chase, can be used in presence 
of cyclic inclusion dependencies. Exploring un- 
chase for answering queries using views is an- 
other direction of our future work. 
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